Biological sequence variant characterization

ABSTRACT

Short fixed length sub-sequences, defined as reference sub-sequences, are extracted from a collection of reference sequences, and an index is constructed showing which short fixed length reference sub-sequence occurs in which reference sequences. Short fixed length sub-sequences, the same length as the reference sub-sequences and defined as source sub-sequences, are extracted from a collection of source sequences derived from a sample for which the signature is to be determined, and the short fixed length source sub-sequences are compiled to determine the frequency of each within the collection. The presence or absence of source sub-sequences in combination with the index is used to infer the presence or absence of reference sequences from the reference collection.

CROSS REFERENCE TO RELATED APPLICATION

This application is a Continuation of U.S. patent application Ser. No.14/511,829, filed on Oct. 10, 2014, the disclosure of which isincorporated herein by reference in its entirety.

FIELD

The field relates to analysis of biological sequences and, moreparticularly, to characterization of biological sequences, for example,sequence variants.

REFERENCE TO SEQUENCE LISTING

The Sequence Listing in ASCII electronic format submitted electronicallyvia EFS-web is incorporated herein by reference in its entirety. TheSequence listing file, entitled YOR920140445US2 SEQ ASCII.txt, wascreated on Jul. 1, 2015 and is 860 bytes in size.

BACKGROUND

In general, a sequence is an ordered selection of symbols drawn from analphabet. The positions in a sequence of length n may be numbered from 0to n−1. Alphabets may include, but are not limited to, thedeoxyribonucleic acid (DNA) alphabet (A, C, G, T), the ribonucleic acid(RNA) alphabet (A, C, G, U) and the amino acid alphabet. Sequences thatcan be represented using such alphabets are called biological sequences(e.g., DNA sequences, RNA sequences, and protein sequences).

DNA and RNA may be double stranded. It is typically (but notnecessarily) the case that each source sequence in a set of sourcesequences (e.g. obtained from a DNA sequencing machine) is arbitrarilyderived from one strand or the other. As a consequence, to correctlyinterpret the set of source sequences, it is necessary to consider foreach source sequence, the sequence that would arise from reading thecomplementary strand. This other sequence is called the reversecomplement sequence. It is equivalent to the sequence obtained byreversing the original sequence and replacing each symbol with itscomplementary symbol. For types of sequence for which reversecomplementation is well defined, each symbol in the alphabet will have acomplementary symbol (e.g., for DNA, the complements are symmetric: Aand T are complementary as are C and G). For example, the reversecomplement of the DNA sequence AACGCTTCGA (SEQ ID NO. 1) is TCGAAGCGTT(SEQ ID NO. 2).

Genotypic characterization (genotyping) is an important method for theidentification of organisms and the determination of the relationshipsbetween them, often using DNA sequence data. Common genotypingtechniques include Multiple Locus Variable number tandem repeat Analysis(MLVA) and Multi-Locus Sequence Typing (MLST), which index geneticvariation at defined genomic loci (as further explained below) andcreate multi-locus profiles that are collected in a database. A genomeis the entire complement of DNA in a cell, while a locus is a specificposition in the genome (e.g., a region encoding a gene or a letter(“base”), coordinate). In the case of bacteria, the domain of life towhich these techniques are most commonly applied, the genome includesthe chromosome(s) and any phage or plasmid sequences contained withinthe cell.

Bacteria are haploid organisms, i.e., in the usual state there is only asingle copy of each chromosome per cell and therefore, in the majorityof cases, only a single copy of each locus (there are exceptions whenloci are duplicated on the same chromosome or on a plasmid). Incontrast, among non-haploid organisms, there is more than one copy ofeach chromosome per cell and therefore multiple copies of each locus,e.g., in humans, which are diploid, there are two copies of eachchromosome in each normal cell (excluding germ cells and the X/Ychromosomes).

SUMMARY

Embodiments of the invention provide techniques for characterization ofbiological sequence signatures.

For example, in one embodiment, a method for determining a signaturefrom biological sequence data comprises the following steps. Short fixedlength sub-sequences, defined as reference sub-sequences, are extractedfrom a collection of reference sequences, and an index is constructedshowing which short fixed length reference sub-sequence occurs in whichreference sequences. Short fixed length sub-sequences, the same lengthas the reference sub-sequences and defined as source sub-sequences, areextracted from a collection of source sequences derived from a samplefor which the signature is to be determined, and the short fixed lengthsource sub-sequences are compiled to determine the frequency of eachwithin the collection. The presence or absence of source sub-sequencesin combination with the index is used to infer the presence or absenceof reference sequences from the reference collection.

Advantageously, illustrative embodiments provide methodologies foranalysis of biological sequences including, but not limited to,identification and typing (e.g., genetic and/or proteomiccharacterization) of organisms.

These and other objects, features, and advantages of the presentinvention will become apparent from the following detailed descriptionof illustrative embodiments thereof, which is to be read in connectionwith the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows MLVA and MLST characterization techniques with which one ormore embodiments of the invention may be implemented.

FIG. 2 illustrates an example of next-generation sequence sampling withwhich one or more embodiments of the invention may be implemented.

FIG. 3 illustrates an example of a k-mer frequency distribution withwhich one or more embodiments of the invention may be implemented.

FIG. 4 illustrates a schematic representation of an MLVA amplicon inaccordance with a biological sequence characterization methodology ofone or more embodiments of the invention.

FIG. 5 illustrates a de Bruijn graph in accordance with a biologicalsequence characterization methodology of one or more embodiments of theinvention.

FIG. 6 illustrates sequence reads spanning an MLVA locus repeat regionin accordance with a biological sequence characterization methodology ofone or more embodiments of the invention.

FIG. 7 illustrates construction of a reference k-mer matrix inaccordance with a biological sequence characterization methodology ofone or more embodiments of the invention.

FIG. 8 illustrates identification of reference alleles in accordancewith a biological sequence characterization methodology of one or moreembodiments of the invention.

FIG. 9 illustrates a biological sequence characterization system inaccordance with one or more embodiments of the invention.

FIG. 10 illustrates a computing platform on which biological sequencecharacterization modules/steps are performed in accordance with one ormore embodiments of the invention.

DETAILED DESCRIPTION

MLVA and MLST techniques characterize individuals based on geneticvariation across multiple loci, which are dispersed around the genome,as will be further explained below in the context of FIG. 1. The MLVAtechnique involves characterization of the locus by inference of thenumber of tandem copies of a given DNA sequence repeat motif e.g., thetriplet “ATT.” Conventional MLVA techniques are described in well-knowntechnical literature, for example, see P. Keim et al., “Multiple-LocusVariable-Number Tandem Repeat Analysis Reveals Genetic RelationshipsWithin Bacillus anthracis,” J. Bacteriol., 182:2928-2936, 2000. The MLSTtechnique involves characterization of the locus by inference of theexact DNA sequence (called an allele) over a specific region (typicallyabout 400-500 bases in length). Conventional MLST techniques aredescribed in well-known technical literature, for example, see M. C. J.Maiden et al., “Multilocus Sequence Typing: A Portable Approach to theIdentification of Clones Within Populations of PathogenicMicroorganisms,” Proc. Natl. Acad. Sci., USA, 95:3140-3145, 1998. TheMLVA type is the list of integers representing the copy number at eachlocus. Note that the repeat motifs are not identical for each locus.Additionally, repeats may be inexact, e.g., nine of ten copies at alocus could be “ATT” and the remaining copy could be “GTT.” The existinglaboratory based technique may infer the copy number from the totallength of the tandem repeat.

In contrast, MLST is an exact matching technique where each distinct DNAsequence at a locus is assigned a distinct allele number (where allelesdescribe sequence variants at a given locus). Subsequently, eachdistinct combination of allele numbers, known as an “allelic profile,”is assigned a distinct sequence type (ST). Allele sequences and profilesfor the standard MLST schemes can be queried in the international MLSTdatabases (e.g., www.pubmlst.org and www.mlst.net). Like MLVA, MLSTschemes are generally species-specific and for some species there aremultiple international schemes.

MLVA and MLST techniques are useful for determining the geneticrelatedness of individuals and allow observation of epidemiologicaltrends.

FIG. 1 illustrates an MLVA technique and an MLST technique. Circles onthe left represent bacterial chromosomes and letters indicate therelative positions of loci included in the MLVA (upper or referencenumeral 100) and MLST (lower or reference numeral 110) schemes. NoteMLVA locus A is different from MLST locus A, MLVA locus B is differentfrom MLST locus B, etc. MLVA loci are defined by the number of copies ofa specified tandem repeat motif e.g., “ACT” as shown in underline forMLVA locus A (upper center). Repeat copy counts for all loci arecombined to define the MLVA type, as shown in the upper right box. MLSTloci are defined by their distinct DNA sequence and numberedsequentially in the (arbitrary) order of discovery/curation, e.g., asshown for MLST locus A (lower center—for convenience, underlined texthighlights variable positions). Distinct combinations of allele numbersare assigned distinct STs, numbered sequentially in the (arbitrary)order of discovery/curation, as shown in the lower right box.

The conventional way of performing MLVA is to use experimentallaboratory techniques comprising the polymerase chain reaction (PCR) togenerate multiple copies of (“amplify”) selected regions of the DNA froma purified microorganism sample (known as an “isolate”). The sequencecopies generated during the PCR reaction are known as “amplicons” andare subjected to electrophoresis to determine their lengths, or bytargeted sequencing (usually capillary-based Sanger sequencing). Thenumber of tandem repeat motif copies can be calculated from the lengthor the amplicon or sequence. The process is completed for each MLVAlocus as defined in the species-specific reference scheme.

The conventional way of performing MLST is to use PCR to amplify theloci of interest, followed by targeted sequencing (usuallycapillary-based Sanger sequencing at two times (2 x) coverage). However,it is realized here that these conventional techniques are becomingincreasingly redundant given the recent advent of next-generationsequencing (NGS) platforms, which enable DNA sequence data to becollected cheaply and at an unprecedented scale, without the need toselect and target specific sequence regions. For a description ofexample NGS technologies, see M. L. Metzker, “SequencingTechnologies—the Next Generation,” Nat. Rev. Genet., 11:31-46, 2010.

As generally applied, NGS technologies (also known as, e.g.,“high-throughput sequencing,” or for the newest generation, “thirdgeneration sequencing”) generate large numbers (usually millions) oftypically short, overlapping sequence reads at greater than two times(>2×) coverage (i.e., each base of the sample sequence is sampled >2×).FIG. 2 illustrates an example of next-generation sequence sampling. Thetrue sample sequence is represented in the shaded box 200. Sequencereads (e.g., 210 in FIG. 2) represent overlapping regions of the truesequence. Each base of the true sequence is sampled (sequenced) multipletimes, the extent of which is known as “coverage depth” (e.g., 220 inFIG. 2). The true coverage depth varies across the sample sequence andis often approximated as an average. Underlined text indicatessequencing errors where the base represented in the sequence read isdifferent from that in the relevant region of the true sequence. Thistype of error is known as a “substitution.”

Each true read represents a fragment of the true sequence in the sample,which may be DNA, RNA or rtDNA (recombinant DNA). In the case of a DNAsequence, the true sequence may represent the complete genome of asingle individual, a composite of genomes of multiple individuals orspecific amplified region(s) of DNA from one or multiple individuals.When the technique is applied to a purified sample representing thecomplete genome of a single individual, it may also be called“whole-genome sequencing.” In this case, the sequencing protocols aregenerally designed such that sequence reads represent regions that areapproximately randomly distributed across the genome. In the case ofbacterial genomes, the per base coverage is usually in the region of10-100×.

Regardless of the type of sample, there may be known relationshipswithin the set of sequence reads e.g., some sequencing equipment canproduce data in which pairs of sequences are read from the ends of asingle DNA fragment, or the sequencing may have been done in such a wayas to ensure that all the sequences come from a single strand ofdouble-stranded DNA.

In silico derivation of existing assays from NGS data is an importantarea of research and development, and is essential to enablecompatibility between data sources. For MLVA in particular, no NGS basedmethod exists to reliably derive the typing information (repeat motifnumbers/amplicon lengths at each locus of interest).

In contrast, there have been multiple strategies proposed to derive MLSTinformation from NGS data. Previous approaches fall into two categories:mapping based approaches and assembly based approaches. Mapping basedapproaches proceed by mapping or aligning the comparatively short NGSreads to a set of reference sequences and analyzing the results todetermine which allele is present at each locus. Assembly basedapproaches reconstruct the underlying genome by piecing togetheroverlapping reads, then analyze the results to determine which allelesare present (e.g., by the Basic Local Alignment Search Tool (BLAST) andsequence alignment). The existing approaches perform general analysisprocedures (mapping or assembly) then extract the sequence typeinformation from the results of those general procedures.

Illustrative embodiments of the invention provide computational methodsrelated to the identification of signatures from comparatively short,overlapping sequences obtained from a biological sample. The source datafrom which a signature is to be deduced are comprised of one or moresequences of, for example, DNA bases which are derived by somemeasurement technique from the biological sample. The input sequencesmay be individual sequence reads from a sequencing technology, or theymay be the result of processing raw input sequences to combine orotherwise transform them. In the illustrative embodiments describedhere, a primary assumption is that each source sequence corresponds to acontiguous sequence in the biological sample.

The methods of the illustrative embodiments described here are based onk-mer techniques which differ from mapping based techniques. Mappingbased methods proceed by first determining which position on a referencesequence best matches each sequence in the source data. By contrast,k-mer based methods generally proceed by collecting all the k-mers outof the source data, and then performing some kind of analysis on them,without explicitly matching the source sequences. Some k-mer methodsextract k-mers out of reference sequences and use them to classifyindividual sequences in the source data, but like the mappingapproaches, these tend to operate on one source sequence at a time.

K-Mer Preparation

Illustrative embodiments provide methods for extracting two generalkinds of genomic feature from a set of source sequences: (1) primerdelimited tandem repeat sequences; and (2) exact alleles. Both featuretypes are extracted by use of k-mer techniques, where a k-mer is a klength substring from a sequence of length greater than or equal to k.The following processing steps are common to the extraction of bothfeatures.

The first part of the methods described for determining genomicsignatures from sequence data extract the k-mers from the set of sourcesequences. For each sequence in the set, the k-mers at all the positionsare determined. If the source sequences come from a double strandedsource (such as whole-genome or other NGS DNA sequencing), then thereverse complement of each k-mer should also be considered. The k-mersare collated so that a determination can be made as to the number oftimes each k-mer occurred across the whole collection.

For example, given the DNA sequence AACGCTTACGA (SEQ ID NO. 3), andtaking k=3, the list (in order of position) of k-mers that may beextracted is AAC, ACG, CGC, GCT, CTT, TTA, TAC, ACG, CGA, and thecollated collection, including reverse complements is: <AAC,1>, <AAG,1>,<ACG,2>, <AGC,1>, <CGA,1>, <CGC, 1>, <CGT,2>, <CTT,1>, <GCG,1>, <GCT,1>,<GTA,1>, <GTT,1>, <TAA,1>, <TAC,1>, <TCG,1>, <TTA,1>.

It is typically (but not necessarily) the case that the source sequencesmay contain errors. Before the collated k-mers can be used to determinea genomic signature, k-mers arising from errors should be eliminated.This may be done in various ways, but two specific ways are describedfor illustrative purposes.

When DNA sequence data is generated by an NGS instrument, it is usualthat enough sequences are determined that each position in theunderlying sample sequence is covered by multiple fragments asillustrated in FIG. 2. As a consequence, k-mers will appear with afrequency proportional to the coverage depth.

Errors in the source sequences obtained from such instruments areapproximately uniform-randomly distributed, so “false” k-mers that covererrors in source sequences will appear in the collated k-mers with lowerassociated frequency than “true” k-mers. It has been shown that a lowfrequency, assumed false k-mer can be corrected to a similar true k-mer(assumed so because of its higher frequency), see, e.g., P. Medvedev etal., “Error correction of high-throughput sequencing datasets withnon-uniform coverage,” Bioinformatics 27 (13):137-141, 2011.Alternatively, all low frequency k-mers can simply be discarded, whichis more robust provided the coverage depth is sufficient.

Determining the threshold below which a k-mer should be consideredfalse, or above which it should be considered true can be done byexamining the distribution of k-mer frequencies. That is, the number oftimes each k-mer frequency appears may be determined. FIG. 3 shows atypical distribution (denoted by reference numeral 300) of k-merfrequencies. As can clearly be seen, the overall distribution is thecomposition of two component distributions. The left-hand (lowfrequency) component shows the distribution of false k-mers, and theright-hand component shows the distribution of true k-mers. Thethreshold should be set to minimize the number of false k-mers retained,and minimize the number of true k-mers discarded.

A simple but effective way of doing this is to find the first localminimum of the combined distribution (see arrow pointing to localminimum at distribution cross-over in FIG. 3). There may be some noisein the distribution, so the robustness of this technique may be improvedby applying median filtering to the combined distribution.

Alternatively, a more precise estimate may be determined by performingparameter estimation to fit an analytic model to the empiricaldistribution. An example of such a model would use an exponentialdistribution for the false k-mers and a Poisson distribution for thetrue k-mers, and a parameter estimation method such as Maximum Entropyor Levenberg-Marquardt. By doing so, a more precise choice can be madewith an estimated number of true k-mers discarded and an estimatednumber of false k-mers remaining. Moreover, it can also be estimated howwell the model fits the actual distribution, and if there are problemswith the source data, they may be identified by a poor correspondencebetween the analytic and empirical models.

A second, and complementary approach to identifying errors relies oninterpreting the collated k-mers as a de Bruijn graph. Under such aninterpretation, many of the false k-mers lie on short paths that branchoff “true” paths. This method may be used to eliminate most errors thatremain after discarding low frequency k-mers. One illustrative methodfor creating a de Bruijn graph is described in T. C. Conway et al.,“Succinct Data Structures for Assembling Large Genomes,” Bioinformatics,27:479-486, 2011. Another method is described in P. E. Compeau et al.,“How to Apply de Bruijn Graphs to Genome Assembly,” NationalBiotechnology, 29:987-991, 2011.

After performing these or other error correction or elimination methods,the result is a set of k-mers that form a representation of theunderlying sample sequence given by the set of source sequences.

As mentioned above, FIG. 3 illustrates an example of a k-mer frequencydistribution. “K-mer abundance” represents the number of copies of eachdistinct k-mer, while “#k-mers” is the number of distinct k-mers thatappear at the corresponding copy number. The distribution comprises thatof the false k-mers, resulting from sequencing error (low frequencyonly) and that of the true k-mers, representing regions of the truesample sequence. The local frequency minimum at the point of cross-overbetween the distributions is marked.

Fast and Accurate MLVA from Short Reads

A methodology for fast and accurate MLVA from short reads will now bedescribed in accordance with one or more illustrative embodiments. Thisfirst method concerns the characterization of primer delimited tandemsequence repeats, e.g., as forms the basis of genotyping techniques suchas MLVA.

In such techniques, in addition to the source sequences, we have as aninput the primer scheme. In the case of MLVA schemes, which aregenerally species-specific, each scheme is comprised of a series oflocus definitions. Each locus is composed of the name of the locus, thevariable length repeat motif and a pair of known primer sequences thatflank (at some defined distance) the repeat region. The upstream primeris called the forward primer, and the downstream one is called thereverse primer, because under the traditional PCR-based method, they aredesigned to bind to the forward and reverse complementary DNA strands,respectively. The whole sequence from the start of the forward primer tothe end of the reverse primer is equivalent to the PCR amplicon asillustrated in FIG. 4. That is, FIG. 4 is a schematic representation ofan MLVA amplicon 400. The amplicon 400 spans the MLVA locus of interest(shown in the double stranded form where the upper line 410 representsthe forward DNA strand and the lower line 420 represents the reverse DNAstrand). The PCR primer binding sites (forward primer 430 and reverseprimer 440) lie at the outer-most edges of the amplicon at a knowndistance from the tandem repeat motif region 450. The distances betweenthe repeat regions (460 and 470) and each of the primers are specific tothe MLVA locus in question such that the total number of repeat motifcopies can be calculated when the total amplicon length, repeat flankingregion lengths and repeat motif lengths are known.

In order to derive a characterization of the tandem repeat loci ofinterest from the NGS data, the methodology first processes the sourcesequences into k-mers as described above in k-mer preparationdescription. For each locus, the methodology then constructs a de Bruijngraph or other type of overlap graph starting with the sequence matchingthat of the first k-mer in the forward primer. Each k-mer represents anedge in the graph, and the nodes correspond to the k−1 length prefixesand suffixes. The methodology finds all possible paths through the graphthat sequentially traverse the full set of k-mers matching those in thespecified forward primer sequence, and continues to extend the path(s)until it reaches the sequence matching the reverse primer (as will befurther described below in the context of FIG. 5).

If either or both of the primer sequences cannot be exactly matched, themethodology can chose to find low Hamming or Levinshtein distancematches. If there are no low Hamming or Levinshtein distance matches toeither or both primers, or there is no path in the graph between them,then the sample is said to have a null allele—the locus does not occurin the sample.

In some cases, there is a simple non-branched path through the graphfrom the forward to the reverse primer sequences (or low Hamming orLevinshtein distance matches). This will occur if the repeat region isshorter than k, or the repeats are inexact. In these cases, there is asingle possible sequence so the repeat region can be characterizeddirectly from the graph. In the case of MLVA, the goal is to infer thenumber of repeat copies in the repeat region. When there is a simplenon-branched path through the de Bruijn graph, or other overlap graph,the methodology can either scan the amplicon and count the number ofinstances of the repeat motif, or it can use the length of the flanks,the total length of the amplicon and the length of the repeat motif tocompute the number of instances under the formula shown below or asuitable equivalent:

$\frac{A - \left( {P_{1} + P_{2} + F_{1} + F_{2}} \right)}{r} = n$

Where n is the repeat motif copy number; A is the amplicon length; P₁and P₂ are the lengths of the forward and reverse primer binding sites,respectively; F₁ and F₂ are the lengths of the upstream and downstreamrepeat flanking regions, respectively; r is the repeat motif length.(All lengths in bases.)

In many cases, however, the repeat region forms a cycle in the de Bruijngraph, or other overlap graph, and it is not possible to directlycompute the true path, because it is not clear how many times round thecycle the true path runs (see FIG. 5). In these cases, the methodologyreprocesses the source reads looking for reads that completely span therepeat region (see FIG. 6). If the repeat region is longer than the readlength, there is no direct evidence for the length of the repeat. Aspanning read must have its first and last k-mers outside the cycle, butmust pass through the cycle.

If the methodology is computing the copy number from amplicon length,the distribution of the number of times through a cycle each spanningread passes is defined. If the methodology is directly counting repeatcopies, then the methodology can scan the read for this, and define thedistribution. The reason for defining the distribution is that the DNApolymerase utilized during the NGS protocol sometimes slips over repeatsso individual reads can sometimes yield different copy numbers (see FIG.6). As such, a statistical test should be applied in order to estimatethe true repeat copy number.

FIG. 5 illustrates an example of a de bruijn graph with cycle. The graph500 represents the starting, ending and central regions of the path fromthe beginning to the end of the MLVA locus shown in FIGS. 4 and 6. Thecycle representing the tandem repeat sequence region is shown.

FIG. 6 illustrates examples of sequence reads spanning the MLVA locusrepeat region. The true sample sequence is shown (top) as in FIG. 4(shown in the double stranded form where the upper line 410 representsthe forward DNA strand and the lower line 420 represents the reverse DNAstrand) with primer binding sites 430 and 440, repeat region 450, andrepeat flanking regions 460 and 470. Example sequence reads n (610), n−1(620), and n+1 (630) spanning the repeat region are shown below the truesample sequence (read length=27 bases). The first and last k-mer of eachread must fall outside of the repeat region. In this example, k=3 forillustrative purposes only (in practice larger values of k are used toprevent noise/crosstalk with reads representing other regions of thesample sequence). In some cases, the reads do not contain the correctnumber of repeat copies due to slippage of the polymerase enzyme duringthe sequencing reaction.

Accordingly, in one or more illustrative embodiments, a computationalmethod for determining a signature from a biological sequence based onthe length of repetitive sub-sequences and/or the number of copies ofsub-sequence repeat motifs, from biological sequence data comprises thefollowing steps. The biological sequence regions whose length is to bedetermined and/or which contains the repeat motif copies to be counted,are identified by flanking reference sequences (pre- and post-) whichserve as markers, and where there may be extra sequence “letters”between the marker sequence and the repetitive sequence to be measured.Short fixed length sequences (source sub-sequences) are extracted from acollection of source sequences derived from a sample for which thesignature is to be determined, and the extracted short fixed lengthsource sub-sequences are compiled to determine the frequency of eachwithin the collection. Frequency information and/or the overlaprelationships between the short fixed length source sub-sequences may beused to eliminate short fixed length source sub-sequences that may bedue to measurement error. The overlaps between the short fixed lengthsource sub-sequences from the preceding step can be used to find a chainof overlaps from the sub-sequence(s) equivalent to the pre-flankingreference marker sequence to the sub-sequence(s) equivalent to thepost-flanking reference marker sequence. Where the chain could containmultiple instances of any of the short fixed length sourcesub-sequences, that is a cycle, the sequences from the collectionderived from the sample are examined to find any sequences that span thecycle, and: (i) use the lengths of the spanning sequences to determinethe length of the cycle; and/or (ii) count the number of repeat motifcopies within each spanning sequence. In sub-step (i) above, the lengthof the repetitive sequence region is determined by taking into accountthe distance between the reference marker sequence equivalents from thetwo preceding steps, and the number of extra sequence letters betweenthe reference marker sequence equivalents and the repetitive sequenceregion. Furthermore, in one or more illustrative embodiments, thebiological sequences are classified as, but not limited to, any of: DNA,RNA, and proteins. The sequence repeat regions may be known as, but arenot limited to, tandem repeats, tandem repeat motif regions, andmicro-satellites. The reference marker sequences may be known asprimers. The short fixed length sub-sequences may be known as k-mers.The sample from which the source sub-sequences are derived may be oneof, but are not limited to, whole-genome sequences, high-throughputsequences, whole-exome sequences, transcriptomes, and metagenomes. Thechain of sub-sequence overlaps may take the form of a de Bruijn graph.

Fast and Efficient Method for MLST

A methodology for fast and efficient MLST will now be described inaccordance with one or more illustrative embodiments. This second methodconcerns the identification of specific sequence variants within abiological sample. The identification of genomic allelic variants isrelevant to genotyping schemes such as MLST, but is also applicable toany scheme for which a reference database of known alleles at one ormore loci of interest is available. Thus, the method described herein isapplicable to schemes other than MLST.

In one illustrative embodiment, the method is an alignment-free methodbased on k-mers and utilizes the k-mer preparation techniques describedabove. The method first provides for the collection of reference allelesfor each locus to be decomposed into reference k-mers and indexed toconstruct a matrix describing the alleles at each locus which containthat k-mer, an example of which is shown in FIG. 7, although othervariations may also be applied. An important feature is that it is anindex which shows for each k-mer in the collection of reference alleles,in which alleles, and possibly in which position within the allele, thatk-mer occurs. In construction of this index, the method enables thedetermination of specific sequence variants of k≧1 length, and as suchdiffers from that describing the use of single k-mer matching todetermine the presence of a given sequence motif of k=1 length e.g., asdescribed for the derivation of spoligotyping information from NGS data.

Thus, FIG. 7 shows an example method 700 for construction of thereference k-mer matrix. In step 710, reference alleles are collected foreach locus. In step 720, reference alleles at each locus are decomposedinto distinct k-mers (e.g., k=3 for convenience here). In step 730, amatrix describing the presence and absence of each distinct k-mer ineach reference allele sequence at each locus is generated. In thisexample, “1” indicates the presence of a k-mer, while “0” indicates itsabsence. For convenience, only three reference alleles at two distinctloci are shown. Steps of method 700 will now be described in furtherdetail.

To analyze a collection of source sequences, each read (or read pair, inthe case of paired end sequencing) is decomposed into k-mers asdescribed above (k-mer preparation), and an orientation for the read (orpair) is chosen by examining whether there are more forward direction orreverse-complement k-mers in the read which are also present in thereference collection. For each k-mer from the reads which also occurs inthe reference collection, the number of instances is counted. Once allthe reads have been processed, we have a sample-specific k-mer frequencyfor each reference k-mer.

To determine which allele is present in the true sample sequence foreach locus, the method first discards any reference allele for which atleast one of its corresponding reference k-mers was identified in thesample at a frequency below a user specified minimum threshold (e.g.,5). Any such allele can be said to be insufficiently supported by thek-mer evidence. The remaining alleles are evaluated with a statisticalhypothesis test against a null model that the frequency of k-mers is dueto noise/crosstalk with sequence reads representing other regions of thesample sequence (i.e., not the locus of interest). Reference allelesthat are judged insignificant are discarded.

Alternatively, if the source sequences were processed and putative errorsequences removed, as described above in k-mer preparation, the list oftrue k-mers can be directly compared against the list of referencek-mers, and the reference index can be used as above to determine whichalleles are supported by the set of source sequences. Given that theglobal processing of the source k-mers is used to determine which aretrue and which are false, the minimum depth threshold used above is notneeded as part of the allele calling process.

All remaining alleles for a locus are judged either to be present in thetrue sample sequences or to be unresolvable. In practice, if the samplewas thought to represent the genome of a single purified haploidorganism (e.g., a bacterial isolate) but there are multiple allelespresent, it usually means that there was contamination (i.e., biologicalmaterial from one or more individual organisms additional to theindividual of interest, was contained within the sample). In such cases,either the user can opt to pick the allele with the lowest statisticalp-value (probability of making a type I statistical error), or reportall candidates. As such, this method can be used to detect contaminationamong supposedly purified haploid organism samples.

By leveraging other information about the sample sequences, e.g., pairedsequence read information or Hi-C information if available, the methodcan be extended to: (1) enable resolution of alleles at heterozygousloci in samples representing diploid organisms; and/or (2) separatealleles originating from different individuals among meta-genomicsamples.

FIG. 8 illustrates a method for identification of reference alleles forwhich all k-mers are present in the sample over the minimum cut-offfrequency threshold. In step 810, a k-mer set is derived from sourcesequences. In step 820, k-mers equivalent to those in each referenceallele at each locus are tallied. In step 830, reference alleles forwhich all k-mers are present over a threshold frequency are identified.

That is, source sequence k-mers are compared to those in the referencek-mer matrix to determine the frequency of each reference k-mer in thesource sequences. A second matrix describing the frequency of k-mers inthe sample matching those in each reference allele at each locus isgenerated, and may take the form of that shown at the bottom of FIG. 8.Source sequence k-mer frequencies over the minimum threshold cut-off areshown in last line of the matrix in shading (frequency >5 in thisexample). Alleles for which all reference k-mers are present in thesource sequences over the threshold frequency are identified andsubjected to statistical assessment.

Furthermore, it is to be appreciated that if there are no candidateallele(s) for a locus, typically it will be either because there wasinsufficient data (e.g., the sequence coverage depth was too low), thesample contains a novel allele that was not included in the referencedatabase, or the sample does not contain that locus, e.g., due to agenomic deletion. To determine the novel allele, the method uses aguided assembly.

The guided assembly is performed by selecting only those sample readscontaining at least one reference k-mer in common with any existingreference allele. The full set of k-mers resulting from the k-merdecomposition of the selected source reads is considered. Low frequencyk-mers are discarded since they are likely to represent sequencingerrors as described above (see k-mer preparation). The remaining set ofk-mers is treated as an order k de Bruijn graph, or other overlap graph.A unique path is then sought from a known initial reference k-mer (i.e.,a k-mer starting at position 0 in any of the known reference alleles) toa known terminal reference k-mer (i.e., a k-mer present at the lastposition in any known reference allele). If either a known initial orterminal k-mer or both are not present in the graph, then a low Hammingor Levenshtein distance neighbor of a known reference k-mer is sought.This logic is applied since in many cases reference loci are chosen suchthat they start and end in regions of the genome that are conservedacross individuals within a species (genus or sub-species).

If multiple paths through the de Bruijn graph or other overlap graph arepresent, then the alternatives are reported as candidate allelesequences. The candidate allele sequences are decomposed into candidatereference k-mers and scored in the same manner as known referencealleles so that a statistical test of significance can be applied asdescribed above.

The method we describe is fast because extracting and manipulatingk-mers (which are stored in single integer words—e.g., uint64_t) is veryefficient. It is also fast because it does not compute alignments forindividual sequence reads. At small values of k (e.g., <15), this wouldbe a problem because of coincidental co-occurrence of k-mers acrosssequence reads representing different parts of the genome (i.e., not theloci of interest), but at the larger values that we use (e.g., 30), thedegree of specificity is high and crosstalk is not a problem.

In comparison to non-targeted assembly techniques which seek to assemblethe full complement of sequences in a sample (e.g., an entire genome),the targeted assembly is fast because it performs very fast cursoryprocessing to select relevant reads, and then performs the moreintensive (but still fast) assembly just on the relevant ones. For asingle bacterial MLST locus, this results in a reduction of assemblysequence length by approximately four orders of magnitude.

Accordingly, in one or more illustrative embodiments, a computationalmethod for determining a signature from biological sequence datacomprises the following steps. Short fixed length sub-sequences(reference sub-sequences) are extracted from a collection of referencesequences, which may be composed of distinct subsets of referencesequences, and an index is constructed showing which short fixed lengthreference sub-sequence occurs in which reference sequences and, ifdesired, at which position. Short fixed length sub-sequences (the samelength as the reference sub-sequences in the above step, and referred toas source sub-sequences) are extracted from a collection of sourcesequences derived from a sample for which the signature is to bedetermined, and those short fixed length source sub-sequences arecompiled to determine the frequency of each within the collection. Thefrequency information and/or the overlap relationships between the shortfixed length source sub-sequences may be used to eliminate short fixedlength source sub-sequences that may be due to measurement error. Thepresence or absence of short fixed length source sub-sequences from theresults of the preceding step in combination with the constructed indexare used to infer the presence or absence of reference sequences withrespect to the reference collection. Where a member of a subset of thereference collection is expected but not found to be present in thepreceding step, the initial and final short fixed length referencesub-sequences of the members of the subset of reference sequence areused, along with the overlaps between the short fixed length sourcesub-sequences, to derive a novel member of that subset of the referencesequences. Furthermore, in one or more illustrative embodiments, thebiological sequences are classified as, but not limited to, any of: DNA,RNA, and proteins. The short fixed length sub-sequences may be known ask-mers. The sample from which the source sub-sequences are derived maybe one of, but are not limited to, next-generation sequences,whole-genome sequences, whole-exome sequences, transcriptomes, andmetageomes. The reference sequence collection may include, but is notlimited to, allelic (geno)typing schemes, e.g., multi-locus sequencetyping or whole-genome sequence typing. Sub-sequence overlaps from whichnovel sequence variants are identified may take the form of a de Bruijngraph.

System and Computing Platform

FIG. 9 illustrates a biological sequence characterization system 900. Asshown, system 900 includes two main modules: module 910 which isconfigured to perform MLVA using k-mer sequences; and module 920 whichis configured to perform MLST using k-mer sequences. More particularly,module 910 is configured to extract a biological signature, in the formof primer delimited tandem repeat sequences, from input biologicalsequence data using one or more of the techniques described above in thecontext of FIGS. 4-6. Further, module 920 is configured to extract abiological signature, in the form of exact alleles, from inputbiological sequence data using one or more of the techniques describedabove in the context of FIGS. 7 and 8. Variations of these modules maybe realized as explained above in the context of the various embodimentspresented. Also, while system 900 is illustrated with modules 910 and920 collocated in the same system, it is to be appreciated that eachmodule, as well as parts thereof, can be implemented in a distributedmanner across multiple systems and/or sub-systems.

Embodiments of the present invention may be a system, a method, and/or acomputer program product. The computer program product may include acomputer readable storage medium (or media) having computer readableprogram instructions thereon for causing a processor to carry outaspects of the present invention.

Accordingly, the architecture shown in FIG. 10 may be used to implementthe various components/steps shown and described above in the context ofFIGS. 1-9.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Smalltalk, C++ or the like, andconventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the Figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

One or more embodiments can make use of software running on ageneral-purpose computer or workstation. With reference to FIG. 10, in acomputing node 1010 there is a computer system/server 1012, which isoperational with numerous other general purpose or special purposecomputing system environments or configurations. Examples of well-knowncomputing systems, environments, and/or configurations that may besuitable for use with computer system/server 1012 include, but are notlimited to, personal computer systems, server computer systems, thinclients, thick clients, handheld or laptop devices, multiprocessorsystems, microprocessor-based systems, set top boxes, programmableconsumer electronics, network PCs, minicomputer systems, mainframecomputer systems, and distributed cloud computing environments thatinclude any of the above systems or devices, and the like.

Computer system/server 1012 may be described in the general context ofcomputer system executable instructions, such as program modules, beingexecuted by a computer system. Generally, program modules may includeroutines, programs, objects, components, logic, data structures, and soon that perform particular tasks or implement particular abstract datatypes. Computer system/server 1012 may be practiced in distributed cloudcomputing environments where tasks are performed by remote processingdevices that are linked through a communications network. In adistributed cloud computing environment, program modules may be locatedin both local and remote computer system storage media including memorystorage devices.

As shown in FIG. 10, computer system/server 1012 in computing node 1010is shown in the form of a general-purpose computing device. Thecomponents of computer system/server 1012 may include, but are notlimited to, one or more processors or processing units 1016, a systemmemory 1028, and a bus 1018 that couples various system componentsincluding system memory 1028 to processor 1016.

The bus 1018 represents one or more of any of several types of busstructures, including a memory bus or memory controller, a peripheralbus, an accelerated graphics port, and a processor or local bus usingany of a variety of bus architectures. By way of example, and notlimitation, such architectures include Industry Standard Architecture(ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA)bus, Video Electronics Standards Association (VESA) local bus, andPeripheral Component Interconnects (PCI) bus.

The computer system/server 1012 typically includes a variety of computersystem readable media. Such media may be any available media that isaccessible by computer system/server 1012, and it includes both volatileand non-volatile media, removable and non-removable media.

The system memory 1028 can include computer system readable media in theform of volatile memory, such as random access memory (RAM) 1030 and/orcache memory 1032. The computer system/server 1012 may further includeother removable/non-removable, volatile/nonvolatile computer systemstorage media. By way of example only, storage system 1034 can beprovided for reading from and writing to a non-removable, non-volatilemagnetic media (not shown and typically called a “hard drive”). Althoughnot shown, a magnetic disk drive for reading from and writing to aremovable, non-volatile magnetic disk (e.g., a “floppy disk”), and anoptical disk drive for reading from or writing to a removable,non-volatile optical disk such as a CD-ROM, DVD-ROM or other opticalmedia can be provided. In such instances, each can be connected to thebus 1018 by one or more data media interfaces. As depicted and describedherein, the memory 1028 may include at least one program product havinga set (e.g., at least one) of program modules that are configured tocarry out the functions of embodiments of the invention.

A program/utility 1040, having a set (at least one) of program modules1042, may be stored in memory 1028 by way of example, and notlimitation, as well as an operating system, one or more applicationprograms, other program modules, and program data. Each of the operatingsystem, one or more application programs, other program modules, andprogram data or some combination thereof, may include an implementationof a networking environment. Program modules 1042 generally carry outthe functions and/or methodologies of embodiments of the invention asdescribed herein.

Computer system/server 1012 may also communicate with one or moreexternal devices 1014 such as a keyboard, a pointing device, a display1024, etc., one or more devices that enable a user to interact withcomputer system/server 1012, and/or any devices (e.g., network card,modem, etc.) that enable computer system/server 1012 to communicate withone or more other computing devices. Such communication can occur viaInput/Output (I/O) interfaces 1022. Still yet, computer system/server1012 can communicate with one or more networks such as a local areanetwork (LAN), a general wide area network (WAN), and/or a publicnetwork (e.g., the Internet) via network adapter 1020. As depicted,network adapter 1020 communicates with the other components of computersystem/server 1012 via bus 1018. It should be understood that althoughnot shown, other hardware and/or software components could be used inconjunction with computer system/server 1012. Examples include, but arenot limited to, microcode, device drivers, redundant processing units,external disk drive arrays, RAID systems, tape drives, and data archivalstorage systems, etc.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may bemade by one skilled in the art without departing from the scope orspirit of the invention.

What is claimed is:
 1. A method for determining a signature frombiological sequence data, comprising: extracting short fixed lengthsub-sequences, defined as reference sub-sequences, from a collection ofreference sequences, and constructing an index showing which short fixedlength reference sub-sequence occurs in which reference sequences;extracting short fixed length sub-sequences, the same length as thereference sub-sequences and defined as source sub-sequences, from acollection of source sequences derived from a sample for which thesignature is to be determined, and compiling the short fixed lengthsource sub-sequences to determine the frequency of each within thecollection; and using the presence or absence of source sub-sequences incombination with the index to infer the presence or absence of referencesequences from the reference collection; wherein one or more of theabove steps are performed in accordance with a processor and a memory.2. The method of claim 1, wherein when a member of a subset of thereference collection is expected but not found to be present, theinitial and final short fixed length reference sub-sequences of themembers of the subset of the reference sequences are used, along withoverlaps between the short fixed length source sub-sequences, to derivea novel member of that subset of the reference sequences.
 3. The methodof claim 1, further comprising using at least one of the frequency ofthe extracted short fixed length source sub-sequences and overlaprelationships between the short fixed length source sub-sequences toeliminate short fixed length source sub-sequences that are due tomeasurement error.
 4. The method of claim 1, wherein the biologicalsequence data is classified as at least one of DNA, RNA, and proteins.5. The method of claim 1, wherein the short fixed length sub-sequencescomprise k-mers.
 6. The method of claim 1, wherein the sample from whichthe source sub-sequences are derived comprises one of next-generationsequences, high-throughput sequences, whole-genome sequences,whole-exome sequences, transcriptomes, and metagenomes.
 7. The method ofclaim 1, wherein the reference sequence collection comprises an allelicgenotyping scheme.
 8. The method of claim 7, wherein the allelicgenotyping scheme comprises one of a multi-locus sequence typing andwhole-genome sequence typing.
 9. The method of claim 1, whereinsub-sequence overlaps from which novel sequence variants are identifiedare in the form of a de Bruijn graph.
 10. A method comprising: collatingshort fixed length sequences from each position in a referencesubsequence to produce an index; collating short fixed length sequencesfrom experimentally derived sequences, producing a list of short fixedlength sequences with their associated frequency; discarding short fixedlength sequences with low frequency; and for each of the remaining shortfixed length sequences in the index, making a record of which ones arepresent, such that when the indexed short fixed length sequences arepresent, a conclusion is made that the reference subsequence is presentin the experimentally derived sequences; wherein one or more of theabove steps are performed in accordance with a processor and a memory.11. The method of claim 10, wherein the index comprises a matrix with arow for each short fixed length sequence in the collection of referencesubsequences, and a column for each reference sequence, and for eachcolumn, a first value is stored in the rows corresponding to the shortfixed length sequences that are present in the reference subsequence forthat column.
 12. The method of claim 11, wherein the short fixed lengthsequences are extracted from the experimentally derived sub-sequences.13. The method of claim 12, wherein for each short fixed lengthsequence, the corresponding row of the matrix is consulted and for eachreference subsequence corresponding to a column containing the firstvalue, the presence of the short fixed length sequence is recorded and,after processing the short fixed length sequences, any referencesubsequence for which its short fixed length sequences were observed isreported as being present in the experimentally derived data.